In an ever-volatile market where every dollar counts, the used car market represents a critical sector of the consumer industry. With rising consumer demand and an increasing variety of vehicles entering the secondary market, understanding the factors that influence used car prices is essential. Buyers seek to make informed decisions based on value for money, while sellers aim to maximize returns by accurately pricing their vehicles. Bridging this gap requires a data-driven approach to uncover the relationships between vehicle specifications and market pricing.
This project utilizes the “Vehicle dataset” by Nehal Birla, Nishant Verma, and Nikhil Kushwaha, available on Kaggle at https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho/data?select=Car+details+v3.csv.
This dataset aggregates detailed information on over 10,000 used cars, including key variables such as fuel type, transmission, engine capacity, mileage, and kilometers driven, alongside categorical variables like seller type, ownership history, and geographic location.
Our analysis is driven by three core objectives:
Using statistical modeling techniques, including regression analysis, we aim to deliver a robust and interpretable model that not only predicts car prices but also highlights the most influential factors driving price variations. Our results will shed light on market dynamics, offering actionable insights for both consumers and industry professionals navigating this volatile space.
By the conclusion of this project, we aim to provide a detailed analysis that enhances understanding of the used car market, aiding stakeholders in making informed decisions in an ever-changing economic landscape.
The dataset contains the following attributes:
| Attribute Name | Description | |
|---|---|---|
| 1 | name |
The make and model of the vehicle (e.g., Hyundai i10, Honda City). |
| 2 | year |
The year the vehicle was manufactured. |
| 3 | selling_price |
The selling price of the vehicle in Indian Rupees (INR). |
| 4 | km_driven |
The total distance the vehicle has been driven (in km). |
| 5 | fuel |
The type of fuel used by the vehicle (“Diesel”, “Petrol”, “LPG”, or “CNG”). |
| 6 | seller_type |
The type of seller (“Individual”, “Dealer” or “Trustmark Dealer”). |
| 7 | transmission |
The type of transmission system (“Manual” or “Automatic”). |
| 8 | owner |
The number of previous owners of the vehicle (“First Owner”, “Second Owner”, “Third Owner”, “Fourth and Above Owner”, or “Test Drive Car”). |
| 9 | mileage |
The fuel efficiency of the vehicle (in either km/l or km/kg). |
| 10 | engine |
The engine displacement capacity (in CC). |
| 11 | max_power |
The maximum power output of the vehicle’s engine, measured in brake horsepower (bhp). |
| 12 | torque |
A pair of torque and RPM values representing the max torque of the vehicle. The torque values are in either Nm or kgm, and the RPM values are either values or value ranges. |
| 13 | seats |
The seating capacity of the vehicle as an integer. |
The above attributes can be categorized into numeric and categorical variables. Numeric variables can be separated into discrete and continuous variables.
| Attribute Type | Attribute Name |
|---|---|
| Categorical Variables | name, fuel, seller_type,
transmission, owner |
| Discrete Numeric Variables | year, km_driven, seats |
| Continuous Numeric Variables | selling_price, mileage,
engine, max_power, torque |
# Set document-wide options
options(tibble.width = Inf) # Make tibbles display all columns
car_details = read_csv("Car details v3.csv")
nrow(car_details)
## [1] 8128
head(car_details)
## # A tibble: 6 × 13
## name year selling_price km_driven fuel seller_type
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Maruti Swift Dzire VDI 2014 450000 145500 Diesel Individual
## 2 Skoda Rapid 1.5 TDI Ambition 2014 370000 120000 Diesel Individual
## 3 Honda City 2017-2020 EXi 2006 158000 140000 Petrol Individual
## 4 Hyundai i20 Sportz Diesel 2010 225000 127000 Diesel Individual
## 5 Maruti Swift VXI BSIII 2007 130000 120000 Petrol Individual
## 6 Hyundai Xcent 1.2 VTVT E Plus 2017 440000 45000 Petrol Individual
## transmission owner mileage engine max_power
## <chr> <chr> <chr> <chr> <chr>
## 1 Manual First Owner 23.4 kmpl 1248 CC 74 bhp
## 2 Manual Second Owner 21.14 kmpl 1498 CC 103.52 bhp
## 3 Manual Third Owner 17.7 kmpl 1497 CC 78 bhp
## 4 Manual First Owner 23.0 kmpl 1396 CC 90 bhp
## 5 Manual First Owner 16.1 kmpl 1298 CC 88.2 bhp
## 6 Manual First Owner 20.14 kmpl 1197 CC 81.86 bhp
## torque seats
## <chr> <dbl>
## 1 190Nm@ 2000rpm 5
## 2 250Nm@ 1500-2500rpm 5
## 3 12.7@ 2,700(kgm@ rpm) 5
## 4 22.4 kgm at 1750-2750rpm 5
## 5 11.5@ 4,500(kgm@ rpm) 5
## 6 113.75nm@ 4000rpm 5
car_details = na.omit(car_details)
nrow(car_details)
## [1] 7906
sum(duplicated(car_details))
## [1] 1189
car_details = car_details[!duplicated(car_details),]
nrow(car_details)
## [1] 6717
name variable# Extract the first word from "name" to get the make of the vehicle
car_details$name = word(car_details$name,1)
# Rename "name" variable to "make"
colnames(car_details)[1] = "make"
# Change datatypes of "make" from character to factor
car_details$make = as.factor(car_details$make)
mileage variable# View raw data for "mileage" variable
head(car_details$mileage, n=20)
## [1] "23.4 kmpl" "21.14 kmpl" "17.7 kmpl" "23.0 kmpl" "16.1 kmpl"
## [6] "20.14 kmpl" "17.3 km/kg" "16.1 kmpl" "23.59 kmpl" "20.0 kmpl"
## [11] "19.01 kmpl" "17.3 kmpl" "19.3 kmpl" "18.9 kmpl" "18.15 kmpl"
## [16] "24.52 kmpl" "23.0 kmpl" "19.7 kmpl" "22.54 kmpl" "21.0 kmpl"
# Drop 86 rows where "mileage" value contains the text "km/kg"
car_details <- car_details[!grepl("km/kg", car_details$mileage), ]
# Extract the numeric value from "mileage"
car_details$mileage = word(car_details$mileage,1)
# Change data_type of "mileage" from character to numeric
car_details$mileage=as.numeric(car_details$mileage)
# Rename "mileage" to "fuel_efficiency"
colnames(car_details)[which(names(car_details) == "mileage")] = "fuel_efficiency"
# View new data for "fuel_efficiency" variable
head(car_details$fuel_efficiency, n=20)
## [1] 23.40 21.14 17.70 23.00 16.10 20.14 16.10 23.59 20.00 19.01 17.30 19.30
## [13] 18.90 18.15 24.52 23.00 19.70 22.54 21.00 25.50
Above, the mileage variable is renamed to
fuel_efficiency to prevent confusion with the
km_driven variable, which represents the quantity normally
referred to as vehicle “mileage”.
engine variable# View raw data for "engine" variable
head(car_details$engine, n=20)
## [1] "1248 CC" "1498 CC" "1497 CC" "1396 CC" "1298 CC" "1197 CC" "796 CC"
## [8] "1364 CC" "1399 CC" "1461 CC" "993 CC" "1248 CC" "1061 CC" "1198 CC"
## [15] "1248 CC" "1396 CC" "796 CC" "1396 CC" "1461 CC" "1498 CC"
# Extract the numeric value from each entry (remove the text "CC" from the end of each entry)
car_details$engine = word(car_details$engine,1)
# Change data_type of "engine" from character to numeric
car_details$engine=as.numeric(car_details$engine)
# View new data for "engine" variable
head(car_details$engine, n=20)
## [1] 1248 1498 1497 1396 1298 1197 796 1364 1399 1461 993 1248 1061 1198 1248
## [16] 1396 796 1396 1461 1498
max_power variable# Transform "max_power" variable
# View raw data for "max_power" variable
head(car_details$max_power, n=20)
## [1] "74 bhp" "103.52 bhp" "78 bhp" "90 bhp" "88.2 bhp"
## [6] "81.86 bhp" "37 bhp" "67.1 bhp" "68.1 bhp" "108.45 bhp"
## [11] "60 bhp" "73.9 bhp" "67 bhp" "82 bhp" "88.5 bhp"
## [16] "90 bhp" "46.3 bhp" "88.73 bhp" "64.1 bhp" "98.6 bhp"
# Extract the numeric value from each entry of "max_power" (remove the text "bhp" from the end of each entry)
car_details$max_power = word(car_details$max_power,1)
# Change data_type of "max_power" from character to numeric
car_details$max_power=as.numeric(car_details$max_power)
# View new data for "max_power" variable
head(car_details$max_power, n=20)
## [1] 74.00 103.52 78.00 90.00 88.20 81.86 37.00 67.10 68.10 108.45
## [11] 60.00 73.90 67.00 82.00 88.50 90.00 46.30 88.73 64.10 98.60
# Change datatypes of "fuel", "seller_type", "transmission", and "owner"
# from character to factor
car_details$fuel=as.factor(car_details$fuel)
car_details$seller_type=as.factor(car_details$seller_type)
car_details$transmission=as.factor(car_details$transmission)
car_details$owner=as.factor(car_details$owner)
km_driven to
km_driven_in_10k# Modify the "km_driven" column by dividing the values of each entry by 10,000 and renaming the column to "km_driven_in_10k"
car_details$km_driven = car_details$km_driven/10000
# Rename km_driven
colnames(car_details)[4]="km_driven_in_10k"
# View data for "km_driven_in_10k" column
head(car_details)[4]
## # A tibble: 6 × 1
## km_driven_in_10k
## <dbl>
## 1 14.6
## 2 12
## 3 14
## 4 12.7
## 5 12
## 6 4.5
selling_price to
selling_price_in_10k# Modify "selling_price" column by dividing the values of each entry by 10,000 and renaming the column to "selling_price_in_10k"
car_details$selling_price = car_details$selling_price/10000
# Rename selling_price to "selling_price_in_10k"
colnames(car_details)[3]="selling_price_in_10k"
# View data for "selling_price_in_10k" column
head(car_details)[3]
## # A tibble: 6 × 1
## selling_price_in_10k
## <dbl>
## 1 45
## 2 37
## 3 15.8
## 4 22.5
## 5 13
## 6 44
torque column# Drop "torque" column
car_details = car_details[,!names(car_details) %in% "torque"]
make_category variable from make
variableunique(car_details$make)
## [1] Maruti Skoda Honda Hyundai Toyota
## [6] Ford Renault Mahindra Tata Chevrolet
## [11] Datsun Jeep Mercedes-Benz Mitsubishi Audi
## [16] Volkswagen BMW Nissan Lexus Jaguar
## [21] Land MG Volvo Daewoo Kia
## [26] Fiat Force Ambassador Ashok Isuzu
## [31] Opel
## 31 Levels: Ambassador Ashok Audi BMW Chevrolet Daewoo Datsun Fiat ... Volvo
# Create new column "make_category" from "make" column
car_details$make_category = ifelse(car_details$make %in%
c("Ambassador", "Ashok", "Daewoo", "Datsun", "Opel", "Fiat"),
"Budget",
ifelse(car_details$make %in%
c("Chevrolet", "Maruti", "Renault", "Mitsubishi", "Ford",
"Honda", "Hyundai", "Isuzu", "Kia", "Toyota", "Force",
"Volkswagen", "Tata", "Skoda", "Jeep", "MG", "Nissan", "Mahindra"),
"Midrange",
"Luxury"
)
)
# Convert make_category to a factor
car_details$make_category = as.factor(car_details$make_category)
# Output all levels of the make_category variable
levels(car_details$make_category)
## [1] "Budget" "Luxury" "Midrange"
# Create a summary table of makes and their categories and view the mapping
make_category_mapping <- unique(car_details[, c("make", "make_category")])
make_category_mapping_df <- as.data.frame(make_category_mapping)
print(make_category_mapping_df, row.names = FALSE)
## make make_category
## Maruti Midrange
## Skoda Midrange
## Honda Midrange
## Hyundai Midrange
## Toyota Midrange
## Ford Midrange
## Renault Midrange
## Mahindra Midrange
## Tata Midrange
## Chevrolet Midrange
## Datsun Budget
## Jeep Midrange
## Mercedes-Benz Luxury
## Mitsubishi Midrange
## Audi Luxury
## Volkswagen Midrange
## BMW Luxury
## Nissan Midrange
## Lexus Luxury
## Jaguar Luxury
## Land Luxury
## MG Midrange
## Volvo Luxury
## Daewoo Budget
## Kia Midrange
## Fiat Budget
## Force Midrange
## Ambassador Budget
## Ashok Budget
## Isuzu Midrange
## Opel Budget
head(car_details)
## # A tibble: 6 × 13
## make year selling_price_in_10k km_driven_in_10k fuel seller_type
## <fct> <dbl> <dbl> <dbl> <fct> <fct>
## 1 Maruti 2014 45 14.6 Diesel Individual
## 2 Skoda 2014 37 12 Diesel Individual
## 3 Honda 2006 15.8 14 Petrol Individual
## 4 Hyundai 2010 22.5 12.7 Diesel Individual
## 5 Maruti 2007 13 12 Petrol Individual
## 6 Hyundai 2017 44 4.5 Petrol Individual
## transmission owner fuel_efficiency engine max_power seats make_category
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 Manual First Owner 23.4 1248 74 5 Midrange
## 2 Manual Second Owner 21.1 1498 104. 5 Midrange
## 3 Manual Third Owner 17.7 1497 78 5 Midrange
## 4 Manual First Owner 23 1396 90 5 Midrange
## 5 Manual First Owner 16.1 1298 88.2 5 Midrange
## 6 Manual First Owner 20.1 1197 81.9 5 Midrange
car_details datasetstructure_car_details = data.frame(
Column = names(car_details),
Type = unname(sapply(car_details, class)))
print(structure_car_details)
## Column Type
## 1 make factor
## 2 year numeric
## 3 selling_price_in_10k numeric
## 4 km_driven_in_10k numeric
## 5 fuel factor
## 6 seller_type factor
## 7 transmission factor
## 8 owner factor
## 9 fuel_efficiency numeric
## 10 engine numeric
## 11 max_power numeric
## 12 seats numeric
## 13 make_category factor
car_details$selling_price_in_10k[car_details$selling_price_in_10k > 720]
## [1] 1000
car_details=subset(car_details,subset = car_details$selling_price_in_10k < 720,)
selling_price = 1000.# Histogram of selling prices
hist(car_details$selling_price_in_10k,xlab="Selling Price (in 10,000 INR)",
main="Selling Price distribution",
breaks = 30,
col = "lightblue")
make_counts = table(car_details$make)
fuel_type_car_count = table(car_details$fuel)
seller_type_car_count = table(car_details$seller_type)
trans_type_car_count = table(car_details$transmission)
owner_type_car_count = table(car_details$owner)
par(mfrow = c(2, 3))
# Plot 1: Number of cars by make
barplot(sort(make_counts),horiz=TRUE, las = 1,
xlab = "Number of Cars",
ylab = "Car Make",
col = "lightblue",
cex.names = 0.5,
main = "Num. of Cars in Each Make")
# Plot 2: Number of cars by fuel type
barplot(sort(fuel_type_car_count), horiz = TRUE, las = 1, cex.names = 0.9,
xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Fuel Type")
# Plot 3: Number of cars by seller type
barplot(sort(seller_type_car_count), horiz = TRUE, las = 1, cex.names = 0.8,
xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Seller Type")
# Plot 4: Number of cars by transmission type
barplot(sort(trans_type_car_count), horiz = TRUE, las = 1, cex.names = 0.8,
xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Trans Type")
# Plot 5: Number of cars by owner type
barplot(sort(owner_type_car_count), horiz = TRUE, las = 1, cex.names = 0.6,
xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Owner Type")
# Boxplot of selling price by car make
boxplot(selling_price_in_10k ~ make, data = car_details,col=rainbow(length(unique(car_details$make))),
las = 2,cex.axis = 0.7,
main = "Boxplot of Selling Price by Car Make",
xlab = "Car Make",
ylab = "Selling Price (10,000 INR)")
# Boxplot of selling price by fuel type
boxplot(selling_price_in_10k ~ fuel, data = car_details, cex.axis = 0.8,
col = rainbow(length(unique(car_details$fuel))),
main = "Box Plot of Selling Price by Fuel Type",
xlab = "Fuel Type", ylab = "Selling Price (10,000 INR)")
# Boxplot of selling price by seller type
boxplot(selling_price_in_10k ~ seller_type, data = car_details, cex.axis = 0.6,
col = rainbow(length(unique(car_details$seller_type))),
main = "Box Plot of Selling Price by Seller Type",
xlab = "Seller Type", ylab = "Selling Price (10,000 INR)")
# Boxplot of selling price by transmission type
plot(selling_price_in_10k ~ transmission, data = car_details,
col = rainbow(length(unique(car_details$transmission))),
main = "Boxplot of Selling Price by Transmission Type",
xlab = "Transmission Type",
ylab = "Selling Price (10,000 INR)")
# Boxplot of selling price by owner type
plot(selling_price_in_10k ~ owner, data = car_details, las = 2, cex.axis = 0.5,
col = rainbow(length(unique(car_details$owner))),
main = "Boxplot of Selling Price vs Owner Type",
xlab = "Owner Type",
ylab = "Selling Price (10,000 INR)")
# Scatter plot of selling price vs year and transmission type
colours = ifelse(car_details$transmission == "Automatic", "blue", "red")
plot(selling_price_in_10k ~ year,data=car_details,col=colours,pch=19,
main = "Selling Price vs Year and Transmission Type",
xlab = "Year",
ylab = "Selling Price (10,000 INR)")
legend("topleft", legend = c("Automatic", "Manual"),
col = c("blue", "red"), pch = 19)
# Scatter plot of selling price vs year and make category
colours = c("Budget" = "blue", "Midrange"="green","Luxury"= "orange")
plot(selling_price_in_10k ~ year,data=car_details,
col = colours[car_details$make_category],pch=19,
main = "Selling Price vs Year and Make Category",
xlab = "Year",
ylab = "Selling Price (10,000 INR)")
legend("topleft", legend = c("Budget","Midrange","Luxury"),
col = c("blue", "green","orange"), pch = 19)
# Scatter plot of selling price vs km driven
plot(selling_price_in_10k ~ km_driven_in_10k, data=car_details,
xlab="Km Driven (10,000 km)",
ylab="Selling Price (10,000 INR)",
main="Selling Price vs Km Driven")
# Subset out extreme values
car_details = subset(car_details,car_details$km_driven_in_10k < 100,)
# Scatter plot of selling price vs km driven after removing extreme values
plot(selling_price_in_10k ~ km_driven_in_10k, data=car_details,
xlab="Km Driven (10,000 km)",
ylab="Selling Price (10,000 INR)",
main="Selling Price vs Km Driven (extreme values removed)")
# Subset out 15 observations with fuel_efficiency=0
car_details = subset(car_details, car_details$fuel_efficiency!=0,)
# Scatter plot of selling price vs fuel efficiency
plot(selling_price_in_10k ~ fuel_efficiency, data=car_details,
xlab="Fuel Efficiency (km/l)",
ylab="Selling Price (10,000 INR)",
main="Selling Price vs Fuel Efficiency")
# Scatter plot of selling price vs max power (by fuel type)
colours = c("Petrol" = "blue", "Diesel"="green")
plot(selling_price_in_10k ~ max_power, data=car_details,
col=colours[car_details$fuel],
main="Selling Price vs Max Power (by Fuel Type)",
xlab="Max Power (bhp)",
ylab="Selling Price (10,000 INR)" )
legend("topleft", legend = c("Petrol","Diesel"),
col = c("blue", "green"), pch = 19)
# Pairs plot for numeric variables
pairs(selling_price_in_10k ~ year+km_driven_in_10k+fuel_efficiency+engine+max_power+seats ,data=car_details)
# Calculate correlation matrix for numeric variables
cor_mat = cor(car_details[,sapply(car_details,is.numeric)])
cor_mat
## year selling_price_in_10k km_driven_in_10k
## year 1.00000000 0.4395767 -0.45496390
## selling_price_in_10k 0.43957666 1.0000000 -0.19812586
## km_driven_in_10k -0.45496390 -0.1981259 1.00000000
## fuel_efficiency 0.37352047 -0.1258892 -0.24186898
## engine -0.02333154 0.4480476 0.30860425
## max_power 0.15993052 0.6857687 0.04886655
## seats 0.02630866 0.1648269 0.24955889
## fuel_efficiency engine max_power seats
## year 0.3735205 -0.02333154 0.15993052 0.02630866
## selling_price_in_10k -0.1258892 0.44804759 0.68576871 0.16482693
## km_driven_in_10k -0.2418690 0.30860425 0.04886655 0.24955889
## fuel_efficiency 1.0000000 -0.59909460 -0.40078564 -0.48552947
## engine -0.5990946 1.00000000 0.68399138 0.66312580
## max_power -0.4007856 0.68399138 1.00000000 0.25911446
## seats -0.4855295 0.66312580 0.25911446 1.00000000
# Extract high correlation values
high_cor = cor_mat[cor_mat > 0.5 & cor_mat != 1]
high_cor_indices = which(cor_mat > 0.5 & cor_mat != 1, arr.ind = TRUE)
high_cor_df = data.frame(
row = rownames(cor_mat)[high_cor_indices[, 1]],
column = colnames(cor_mat)[high_cor_indices[, 2]],
correlation = high_cor)
# Display first few pairs of high-correlation variables
head(high_cor_df, 6)
## row column correlation
## 1 max_power selling_price_in_10k 0.6857687
## 2 max_power engine 0.6839914
## 3 seats engine 0.6631258
## 4 selling_price_in_10k max_power 0.6857687
## 5 engine max_power 0.6839914
## 6 engine seats 0.6631258
In this section, a model for predicting values of
selling_price based on the values of the other variables is
developed and the performance of the model is analyzed.
Logarithmic transformation can make the positively skewed distribution of selling price to be normally distributed and makes it more suitable for model building
car_details$selling_price_in_10k = log(car_details$selling_price_in_10k)
set.seed(125) # For reproducibility
train_indices = sample(nrow(car_details), size = 0.80 * nrow(car_details))
train_data = car_details[train_indices, ]
test_data = car_details[-train_indices, ]
nrow(train_data)
## [1] 5289
nrow(test_data)
## [1] 1323
make)# Definition of "full" linear model with BIC-based stepwise model selection starting from full model (both directions)
n= nrow(train_data)
bic_full_add_model = step(lm(selling_price_in_10k ~ .-make, data = train_data),direction="both",k=log(n),trace=0)
# Model summary
summary(bic_full_add_model)
##
## Call:
## lm(formula = selling_price_in_10k ~ (make + year + km_driven_in_10k +
## fuel + seller_type + transmission + owner + fuel_efficiency +
## engine + max_power + seats + make_category) - make, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.68520 -0.16717 0.02508 0.19042 1.42921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.217e+02 3.054e+00 -72.603 < 2e-16 ***
## year 1.112e-01 1.530e-03 72.642 < 2e-16 ***
## km_driven_in_10k -3.850e-03 1.033e-03 -3.728 0.000195 ***
## fuelPetrol -1.369e-01 1.296e-02 -10.569 < 2e-16 ***
## seller_typeIndividual -7.010e-02 1.401e-02 -5.003 5.82e-07 ***
## seller_typeTrustmark Dealer 8.294e-02 5.984e-02 1.386 0.165779
## transmissionManual -9.709e-02 1.692e-02 -5.737 1.02e-08 ***
## ownerFourth & Above Owner -1.212e-01 2.880e-02 -4.209 2.61e-05 ***
## ownerSecond Owner -7.404e-02 9.991e-03 -7.411 1.45e-13 ***
## ownerTest Drive Car 6.114e-01 1.317e-01 4.641 3.55e-06 ***
## ownerThird Owner -1.087e-01 1.663e-02 -6.536 6.90e-11 ***
## fuel_efficiency 1.214e-02 1.956e-03 6.207 5.83e-10 ***
## engine 2.566e-04 1.942e-05 13.213 < 2e-16 ***
## max_power 8.500e-03 2.201e-04 38.616 < 2e-16 ***
## seats 4.647e-02 6.311e-03 7.362 2.09e-13 ***
## make_categoryLuxury 5.779e-01 4.906e-02 11.778 < 2e-16 ***
## make_categoryMidrange 2.194e-01 3.336e-02 6.576 5.28e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2909 on 5272 degrees of freedom
## Multiple R-squared: 0.8474, Adjusted R-squared: 0.8469
## F-statistic: 1830 on 16 and 5272 DF, p-value: < 2.2e-16
#Number of parameters
length(coef(bic_full_add_model))
## [1] 17
# Diagnostic plots
par(mfrow = c(1, 2))
plot(bic_full_add_model,which=c(1,2))
# Diagnostic tests
bp_test = bptest(bic_full_add_model)
print(bp_test)
##
## studentized Breusch-Pagan test
##
## data: bic_full_add_model
## BP = 372.41, df = 16, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(bic_full_add_model), size = 5000)
shapiro_test = shapiro.test(residual_subset)
print(shapiro_test)
##
## Shapiro-Wilk normality test
##
## data: residual_subset
## W = 0.98173, p-value < 2.2e-16
# Multicollinearity test
vif(bic_full_add_model)
## GVIF Df GVIF^(1/(2*Df))
## year 2.190700 1 1.480101
## km_driven_in_10k 1.634226 1 1.278369
## fuel 2.588377 1 1.608843
## seller_type 1.112922 2 1.027108
## transmission 1.446943 1 1.202889
## owner 1.412292 4 1.044096
## fuel_efficiency 3.683989 1 1.919372
## engine 5.759182 1 2.399830
## max_power 3.026818 1 1.739775
## seats 2.419890 1 1.555599
## make_category 1.562362 2 1.118009
| Adjusted R^2 | Number of parameters | Breusch-Pagan test | Shapiro-Wilk normality test (with 5000 sample residuals) |
VIF>5 | Non significant predictors(pvalue > 0.05) | |
|---|---|---|---|---|---|---|
| Full additive model with BIC (Excluding “make” predictor) |
0.8469423 | 17 | Constant variance assumption failed with BP = 372.41 |
Normality assumption f ailed with W = 0.98173 |
None | seller_typeTrustmark Dealer |
The “full” linear model has a high \(R^2\) value of 0.8469423, indicating that
84.69% of the variance in selling_price_in_10k is explained
by the predictors.
Most predictors are statistically significant (p-value <
0.05), suggesting they contribute meaningfully to the model. Except
seller_typeTrustmark Dealer (p = 0.165779)
Diagnostic testing for this model:
The Breusch-Pagan test yielded a very high test statistic of 372.41 and a p-value less than \(2.2*10^{-16}\) which rejects the null hypothesis that constant variance exists in residuals
The Shapiro-Wilk test with 5000 sample residuals yielded a low p-value less than \(2.2*10^{-16}\), which rejects the null hypothesis that the residuals are normally distributed.
small_model = lm(selling_price_in_10k ~ make_category*year*max_power + year*transmission + owner + seller_type + km_driven_in_10k+fuel_efficiency, data = train_data)
summary(small_model)
##
## Call:
## lm(formula = selling_price_in_10k ~ make_category * year * max_power +
## year * transmission + owner + seller_type + km_driven_in_10k +
## fuel_efficiency, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.56207 -0.19305 -0.00193 0.20786 1.57063
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.817e+01 8.581e+01 0.445 0.656439
## make_categoryLuxury -3.231e+02 1.393e+02 -2.320 0.020379 *
## make_categoryMidrange -2.221e+02 8.540e+01 -2.601 0.009323 **
## year -1.792e-02 4.268e-02 -0.420 0.674557
## max_power -3.539e+00 1.410e+00 -2.509 0.012130 *
## transmissionManual 2.875e+00 1.071e+01 0.268 0.788462
## ownerFourth & Above Owner -1.059e-01 3.143e-02 -3.370 0.000756 ***
## ownerSecond Owner -7.255e-02 1.092e-02 -6.644 3.36e-11 ***
## ownerTest Drive Car 3.003e-01 1.460e-01 2.057 0.039756 *
## ownerThird Owner -1.128e-01 1.817e-02 -6.206 5.87e-10 ***
## seller_typeIndividual -5.130e-02 1.539e-02 -3.334 0.000861 ***
## seller_typeTrustmark Dealer 1.391e-02 6.520e-02 0.213 0.831115
## km_driven_in_10k 1.176e-02 1.040e-03 11.307 < 2e-16 ***
## fuel_efficiency -7.519e-04 1.411e-03 -0.533 0.594237
## make_categoryLuxury:year 1.615e-01 6.919e-02 2.335 0.019587 *
## make_categoryMidrange:year 1.106e-01 4.247e-02 2.603 0.009257 **
## make_categoryLuxury:max_power 3.372e+00 1.516e+00 2.225 0.026152 *
## make_categoryMidrange:max_power 2.756e+00 1.413e+00 1.951 0.051129 .
## year:max_power 1.766e-03 7.014e-04 2.518 0.011843 *
## year:transmissionManual -1.432e-03 5.316e-03 -0.269 0.787615
## make_categoryLuxury:year:max_power -1.680e-03 7.535e-04 -2.230 0.025805 *
## make_categoryMidrange:year:max_power -1.371e-03 7.025e-04 -1.951 0.051063 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3174 on 5267 degrees of freedom
## Multiple R-squared: 0.8185, Adjusted R-squared: 0.8178
## F-statistic: 1131 on 21 and 5267 DF, p-value: < 2.2e-16
plot(small_model,which=c(1,2))
vif(small_model,type = "predictor")
## GVIFs computed for predictors
## GVIF Df GVIF^(1/(2*Df))
## make_category 44.883224 11 1.188761
## year 2.794605 13 1.040318
## max_power 44.883224 11 1.188761
## transmission 2853.840852 3 3.766214
## owner 1.465339 4 1.048920
## seller_type 1.123246 2 1.029482
## km_driven_in_10k 1.391267 1 1.179520
## fuel_efficiency 1.610527 1 1.269065
## Interacts With
## make_category year, max_power
## year make_category, max_power, transmission
## max_power make_category, year
## transmission year
## owner --
## seller_type --
## km_driven_in_10k --
## fuel_efficiency --
## Other Predictors
## make_category transmission, owner, seller_type, km_driven_in_10k, fuel_efficiency
## year owner, seller_type, km_driven_in_10k, fuel_efficiency
## max_power transmission, owner, seller_type, km_driven_in_10k, fuel_efficiency
## transmission make_category, max_power, owner, seller_type, km_driven_in_10k, fuel_efficiency
## owner make_category, year, max_power, transmission, seller_type, km_driven_in_10k, fuel_efficiency
## seller_type make_category, year, max_power, transmission, owner, km_driven_in_10k, fuel_efficiency
## km_driven_in_10k make_category, year, max_power, transmission, owner, seller_type, fuel_efficiency
## fuel_efficiency make_category, year, max_power, transmission, owner, seller_type, km_driven_in_10k
n= nrow(train_data)
bic_small_model = step(small_model,direction="both",k=log(n),trace=0)
summary(bic_small_model)
##
## Call:
## lm(formula = selling_price_in_10k ~ make_category + year + max_power +
## owner + km_driven_in_10k + make_category:max_power + year:max_power,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5230 -0.1936 0.0028 0.2063 1.5677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.783e+02 6.103e+00 -29.210 < 2e-16 ***
## make_categoryLuxury 2.052e+00 2.689e-01 7.633 2.71e-14 ***
## make_categoryMidrange 3.321e-01 2.054e-01 1.617 0.106016
## year 8.963e-02 3.031e-03 29.573 < 2e-16 ***
## max_power -7.988e-01 7.007e-02 -11.400 < 2e-16 ***
## ownerFourth & Above Owner -1.139e-01 3.143e-02 -3.624 0.000293 ***
## ownerSecond Owner -7.651e-02 1.089e-02 -7.028 2.35e-12 ***
## ownerTest Drive Car 3.162e-01 1.443e-01 2.191 0.028531 *
## ownerThird Owner -1.189e-01 1.812e-02 -6.562 5.83e-11 ***
## km_driven_in_10k 1.132e-02 1.029e-03 11.008 < 2e-16 ***
## make_categoryLuxury:max_power -8.803e-03 2.985e-03 -2.949 0.003202 **
## make_categoryMidrange:max_power -1.312e-03 2.857e-03 -0.459 0.646134
## year:max_power 4.035e-04 3.479e-05 11.596 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.318 on 5276 degrees of freedom
## Multiple R-squared: 0.8175, Adjusted R-squared: 0.8171
## F-statistic: 1969 on 12 and 5276 DF, p-value: < 2.2e-16
plot(bic_small_model,which=c(1,2))
vif(small_model,type = "predictor")
## GVIFs computed for predictors
## GVIF Df GVIF^(1/(2*Df))
## make_category 44.883224 11 1.188761
## year 2.794605 13 1.040318
## max_power 44.883224 11 1.188761
## transmission 2853.840852 3 3.766214
## owner 1.465339 4 1.048920
## seller_type 1.123246 2 1.029482
## km_driven_in_10k 1.391267 1 1.179520
## fuel_efficiency 1.610527 1 1.269065
## Interacts With
## make_category year, max_power
## year make_category, max_power, transmission
## max_power make_category, year
## transmission year
## owner --
## seller_type --
## km_driven_in_10k --
## fuel_efficiency --
## Other Predictors
## make_category transmission, owner, seller_type, km_driven_in_10k, fuel_efficiency
## year owner, seller_type, km_driven_in_10k, fuel_efficiency
## max_power transmission, owner, seller_type, km_driven_in_10k, fuel_efficiency
## transmission make_category, max_power, owner, seller_type, km_driven_in_10k, fuel_efficiency
## owner make_category, year, max_power, transmission, seller_type, km_driven_in_10k, fuel_efficiency
## seller_type make_category, year, max_power, transmission, owner, km_driven_in_10k, fuel_efficiency
## km_driven_in_10k make_category, year, max_power, transmission, owner, seller_type, fuel_efficiency
## fuel_efficiency make_category, year, max_power, transmission, owner, seller_type, km_driven_in_10k
bptest(bic_small_model)
##
## studentized Breusch-Pagan test
##
## data: bic_small_model
## BP = 360.31, df = 12, p-value < 2.2e-16
indicies_to_exclude = unname(which(cooks.distance(bic_small_model) > 4/length(cooks.distance(bic_small_model))))
sum(cooks.distance(bic_small_model) > 4/length(cooks.distance(bic_small_model)))
## [1] 287
train_data_filtered = train_data[-indicies_to_exclude,]
n= nrow(train_data_filtered)
refitted_wip = lm(log(selling_price_in_10k) ~ .+ make_category*year+ make_category*max_power -make-seats-engine, data = train_data_filtered)
#make_category*year*max_power + transmission + owner + seller_type + km_driven_in_10k+mileage, data = train_data_filtered)
summary(refitted_wip)
##
## Call:
## lm(formula = log(selling_price_in_10k) ~ . + make_category *
## year + make_category * max_power - make - seats - engine,
## data = train_data_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42853 -0.04443 0.00546 0.05217 0.27263
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.752e+01 1.232e+01 -5.479 4.50e-08 ***
## year 3.410e-02 6.083e-03 5.605 2.20e-08 ***
## km_driven_in_10k 1.259e-03 3.329e-04 3.781 0.000158 ***
## fuelPetrol -7.216e-02 2.914e-03 -24.765 < 2e-16 ***
## seller_typeIndividual -1.566e-02 3.982e-03 -3.933 8.49e-05 ***
## seller_typeTrustmark Dealer 2.260e-02 1.664e-02 1.358 0.174520
## transmissionManual -8.703e-03 4.803e-03 -1.812 0.070044 .
## ownerFourth & Above Owner -3.849e-02 1.020e-02 -3.773 0.000163 ***
## ownerSecond Owner -1.448e-02 2.860e-03 -5.065 4.24e-07 ***
## ownerTest Drive Car 6.860e-02 4.802e-02 1.428 0.153218
## ownerThird Owner -2.347e-02 4.952e-03 -4.739 2.21e-06 ***
## fuel_efficiency -3.005e-03 3.889e-04 -7.728 1.31e-14 ***
## max_power 2.163e-03 1.563e-03 1.384 0.166432
## make_categoryLuxury 9.183e+00 1.373e+01 0.669 0.503603
## make_categoryMidrange -8.729e+00 1.233e+01 -0.708 0.478942
## year:make_categoryLuxury -4.343e-03 6.784e-03 -0.640 0.522064
## year:make_categoryMidrange 4.339e-03 6.085e-03 0.713 0.475808
## max_power:make_categoryLuxury -1.442e-03 1.589e-03 -0.907 0.364327
## max_power:make_categoryMidrange 6.366e-04 1.563e-03 0.407 0.683796
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08083 on 4983 degrees of freedom
## Multiple R-squared: 0.8416, Adjusted R-squared: 0.841
## F-statistic: 1471 on 18 and 4983 DF, p-value: < 2.2e-16
bic_refitted_wip = step(refitted_wip,direction="both",k=log(n),trace=0)
summary(bic_refitted_wip)
##
## Call:
## lm(formula = log(selling_price_in_10k) ~ year + km_driven_in_10k +
## fuel + seller_type + owner + fuel_efficiency + max_power +
## make_category + max_power:make_category, data = train_data_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42846 -0.04473 0.00582 0.05266 0.27131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.620e+01 8.944e-01 -85.201 < 2e-16 ***
## year 3.838e-02 4.413e-04 86.959 < 2e-16 ***
## km_driven_in_10k 1.225e-03 3.328e-04 3.680 0.000235 ***
## fuelPetrol -7.144e-02 2.898e-03 -24.652 < 2e-16 ***
## seller_typeIndividual -1.521e-02 3.958e-03 -3.844 0.000123 ***
## seller_typeTrustmark Dealer 2.493e-02 1.664e-02 1.498 0.134131
## ownerFourth & Above Owner -3.911e-02 1.021e-02 -3.831 0.000129 ***
## ownerSecond Owner -1.445e-02 2.862e-03 -5.050 4.57e-07 ***
## ownerTest Drive Car 4.230e-02 4.726e-02 0.895 0.370829
## ownerThird Owner -2.350e-02 4.956e-03 -4.743 2.17e-06 ***
## fuel_efficiency -3.006e-03 3.886e-04 -7.735 1.25e-14 ***
## max_power 2.832e-03 1.280e-03 2.213 0.026965 *
## make_categoryLuxury 4.893e-01 1.076e-01 4.549 5.52e-06 ***
## make_categoryMidrange 6.244e-02 9.167e-02 0.681 0.495801
## max_power:make_categoryLuxury -2.086e-03 1.312e-03 -1.590 0.111923
## max_power:make_categoryMidrange -4.183e-06 1.280e-03 -0.003 0.997392
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0809 on 4986 degrees of freedom
## Multiple R-squared: 0.8412, Adjusted R-squared: 0.8407
## F-statistic: 1761 on 15 and 4986 DF, p-value: < 2.2e-16
vif(bic_refitted_wip,type = "predictor")
## GVIFs computed for predictors
## GVIF Df GVIF^(1/(2*Df)) Interacts With
## year 1.957323 1 1.399044 --
## km_driven_in_10k 1.704781 1 1.305672 --
## fuel 1.583004 1 1.258175 --
## seller_type 1.092041 2 1.022256 --
## owner 1.393603 4 1.042359 --
## fuel_efficiency 1.718578 1 1.310945 --
## max_power 1.841547 5 1.062963 make_category
## make_category 1.841547 5 1.062963 max_power
## Other Predictors
## year km_driven_in_10k, fuel, seller_type, owner, fuel_efficiency, max_power, make_category
## km_driven_in_10k year, fuel, seller_type, owner, fuel_efficiency, max_power, make_category
## fuel year, km_driven_in_10k, seller_type, owner, fuel_efficiency, max_power, make_category
## seller_type year, km_driven_in_10k, fuel, owner, fuel_efficiency, max_power, make_category
## owner year, km_driven_in_10k, fuel, seller_type, fuel_efficiency, max_power, make_category
## fuel_efficiency year, km_driven_in_10k, fuel, seller_type, owner, max_power, make_category
## max_power year, km_driven_in_10k, fuel, seller_type, owner, fuel_efficiency
## make_category year, km_driven_in_10k, fuel, seller_type, owner, fuel_efficiency
plot(bic_refitted_wip,which=c(1,2))
bptest(bic_refitted_wip)
##
## studentized Breusch-Pagan test
##
## data: bic_refitted_wip
## BP = 674.62, df = 15, p-value < 2.2e-16